| iv | dv |
|---|---|
| HIGH | No |
| HIGH | No |
| LOW | No |
| HIGH | Yes |
| HIGH | Yes |
| HIGH | Yes |
| HIGH | Yes |
| LOW | Yes |
| LOW | Yes |
| LOW | Yes |
Assuming we have a categorical independent variable (IV) and a categorical dependent variable (DV):
| iv | dv |
|---|---|
| HIGH | No |
| HIGH | No |
| LOW | No |
| HIGH | Yes |
| HIGH | Yes |
| HIGH | Yes |
| HIGH | Yes |
| LOW | Yes |
| LOW | Yes |
| LOW | Yes |
Start by calculating the number of observations with each value of each category:
| iv | dv |
|---|---|
| HIGH | No |
| HIGH | No |
| LOW | No |
| HIGH | Yes |
| HIGH | Yes |
| HIGH | Yes |
| HIGH | Yes |
| LOW | Yes |
| LOW | Yes |
| LOW | Yes |
iv
|
||
|---|---|---|
| dv | LOW | HIGH |
| No | 1 | 2 |
| Yes | 3 | 4 |
| Total | 4 | 6 |
Then, calculate the proportion/percentage of observations among each value of the IV.
If the independent variable is in the columns, then the columns should sum to 100%.
If the independent variable is in the rows, then the rows should sum to 100%.
iv
|
||
|---|---|---|
| dv | LOW | HIGH |
| No | 1 | 2 |
| Yes | 3 | 4 |
| Total | 4 | 6 |
iv
|
||
|---|---|---|
| dv | LOW | HIGH |
| No | 1 (25%) | 2 (33%) |
| Yes | 3 (75%) | 4 (67%) |
| Total | 4 | 6 |
Look at what happens to the DV at different values of the IV. If your variables are ordinal, you should be able to identify a direction of the effect.
The proportion of “Yes” values decreases as the IV goes from lower to higher, so this is a negative or inverse relationship.
iv
|
||
|---|---|---|
| dv | LOW | HIGH |
| No | 1 (25%) | 2 (33%) |
| Yes | 3 (75%) | 4 (67%) |
| Total | 4 | 6 |
Using a bar graph or line graph can make these relationships easier to spot:
Key rule: always calculate percentages or proportions by categories of the independent variable.
If one or both variables are interval-level, you can bin them in order to use them in a cross tab. For instance, you could separate an interval like into a series of age ranges.
Hypothesis: in a comparison of individuals, independents are less likely to turn out to vote compared to people who support one party or another.
How should I calculate proportions here?
Party ID
|
|||
|---|---|---|---|
| turnout2020 | Democrat | Independent | Republican |
| 0. Did not vote | 335 | 316 | 382 |
| 1. Voted | 3160 | 560 | 2714 |
Are these results generally consistent with my hypothesis?
Party ID
|
|||
|---|---|---|---|
| turnout2020 | Democrat | Independent | Republican |
| 0. Did not vote | 335 (10%) | 316 (36%) | 382 (12%) |
| 1. Voted | 3160 (90%) | 560 (64%) | 2714 (88%) |
If we think of party ID as an ordered variable, this is a curvilinear relationship.
What happens if I calculate % among the values of the DV?
Here’s the relationship between education and voter turnout with % calculated on education level:
Education
|
|||||
|---|---|---|---|---|---|
| turnout2020 | 1. Less than high school credential | 2. High school credential | 3. Some post-high school, no bachelor's degree | 4. Bachelor's degree | 5. Graduate degree |
| 0. Did not vote | 130 (41%) | 286 (24%) | 380 (15%) | 135 (7%) | 91 (6%) |
| 1. Voted | 185 (59%) | 883 (76%) | 2148 (85%) | 1749 (93%) | 1388 (94%) |
| Note: | |||||
| Column % in parentheses | |||||
The results suggest a positive or direct relationship: as education increases, so does the % turnout.
What happens if I calculate % among the values of the DV?
Here’s the relationship between education and voter turnout with % calculated across voter turnout
Education
|
|||||
|---|---|---|---|---|---|
| turnout2020 | 1. Less than high school credential | 2. High school credential | 3. Some post-high school, no bachelor's degree | 4. Bachelor's degree | 5. Graduate degree |
| 0. Did not vote | 130 (13%) | 286 (28%) | 380 (37%) | 135 (13%) | 91 (9%) |
| 1. Voted | 185 (3%) | 883 (14%) | 2148 (34%) | 1749 (28%) | 1388 (22%) |
| Note: | |||||
| Row % in parentheses | |||||
Here, the results can give the misleading impression that there’s a curvilinear relationship: turnout drops off for Bachelor’s Degrees and above.
Either of these tables might be a valid way to look at these data, but they answer slightly different questions:
If I want to compare turnout at different levels of education, then I need to calculate % turnout among people with different levels of education.
If I want to compare education among voters and non-voters, then I need to calculate % education among people who voted and didn’t vote.
Which variable is the IV or DV is sometimes a theoretical question, but in this case its unlikely that voting is causing people to become more educated, so it probably doesn’t make sense to calculate percentages by voting vs. non-voting.
When we have interval level outcome and a categorical independent variable, we can group each observation by values of the IV and then calculate the mean across each group.
For instance I want to examine the relationship between national wealth and carbon emissions. My hypothesis is that wealthier nations will have more emissions compared to poorer nations.
| country | gdp.percap.5cat | co2.percap |
|---|---|---|
| Afghanistan | 1. $3k or less | 0.281803 |
| Albania | 3. $10k to $25k | 1.936486 |
| Algeria | 3. $10k to $25k | 3.988271 |
| Angola | 2. $3k to $10k | 1.194668 |
| Argentina | 3. $10k to $25k | 3.995881 |
| Armenia | 3. $10k to $25k | 2.030401 |
| Australia | 5. $45k or more | 16.308205 |
| Austria | 5. $45k or more | 7.648816 |
| Azerbaijan | 3. $10k to $25k | 3.962984 |
| Bahrain | 5. $45k or more | 20.934996 |
GDP data has been grouped into five categories, so now I just need to calculate the average of CO2 emissions within each group of the ordinal IV:
| GDP Per capita range | CO2 emissions per capita |
|---|---|
| 1. $3k or less | 0.3128312 |
| 2. $3k to $10k | 1.2680574 |
| 3. $10k to $25k | 4.4065669 |
| 4. $25k to $45k | 8.0307610 |
| 5. $45k or more | 12.3134306 |
Is this generally consistent with expectations?
Here again, the relationship can be easier to conceptualize if we plot it.
A relationship like this will rarely be perfectly straight, so “linearity” and “curvilinearity” are partly a matter of degree, but there are some cases where there is a clear “U” shape to the relationship:
| iv | dv |
|---|---|
| 1. Extremely liberal | 6.314 |
| 2. Liberal | 5.685 |
| 3. Slightly liberal | 5.001 |
| 4. Moderate; middle of the road | 4.651 |
| 5. Slightly conservative | 4.636 |
| 6. Conservative | 4.974 |
| 7. Extremely conservative | 5.363 |
How can we distinguish correlation from causation?
This process inevitably requires us to consider rival explanations for an observed relationship:
What I want to show is that Fox News viewership is cases a decreased chance of getting a Covid vaccine.
There’s a correlation, but I’m concerned this relationship is spurious because I know that things like existing political views are already correlated with media consumption, and those might explain any correlation I see here:
Its possible that this difference in ideology accounts for the entire observed correlation between media habits and vaccines. I can’t really rule this possibility out without further investigation.
What if I could randomly assign people to watch Fox News? Random assignment would ensure that nothing is correlated with Fox news viewership.
Ideology may still matter for getting a vaccine, but since if conservatism is randomly distributed between social media users and non-users, it no longer confounds the observed relationship.
Experiments use random assignment to account for rival explanations. If you randomly assign people to receive a “treatment”, then you can ensure that there is no confounding because nothing is correlated with your IV.
The classic examples are in medicine:
Group A is randomly assigned to receive a placebo (the control group)
Group B is randomly assigned to receive a medicine (the treatment group)
After a certain period of time, we compare the outcomes for both groups.
Differences between the groups can be attributed to the effect of the treatment (+/- some random sampling error)
Experiments are considered a “gold standard” because they can account for all kinds of confounding, including confounding caused by unobserved or unexpected relationships.
However, they have two key limitations:
External validity: results in the lab may not easily translate to results in real life.
Feasibility: many interesting questions just can’t be randomly assigned. We can assign “democracy” or “war” or “religion” to people.
Field experiments can lessen the external validity problem by using random assignment in the field.
For instance, one common way to study GOTV messaging is to randomly select households to receive mailers:
Civic duty treatment
Hawthorne treatment
Neighbors treatment
Self treatment
From: GERBER, A. S., GREEN, D. P., & LARIMER, C. W. (2008). Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment. American Political Science Review, 102(1), 33–48. doi:10.1017/S000305540808009X
From: GERBER, A. S., GREEN, D. P., & LARIMER, C. W. (2008). Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment. American Political Science Review, 102(1), 33–48. doi:10.1017/S000305540808009X
Field experiments can face fewer external validity problems, but some things still can’t be experimentally manipulated.
Natural experiments use “quasi” randomization or “randomization by nature” where treatments are assigned more-or-less randomly.
Viewing Fox News isn’t random, but areas where Fox News is lower in the channel order will have more viewers.
Channel order is essentially randomly assigned.
So, using channel order as a “treatment” assignment might theoretically allow us to account for confounding in an observational setting.
Other sources of quasi randomization include:
Lotteries (like the Vietnam Draft, or the literal lottery)
Arbitrary cutoffs (barely winning an election vs. barely losing)
Natural disasters and weather events
Still, natural experiments require a mixture of creativity and luck. They’re not available for most questions.
Observational research doesn’t allow us to easily account for rival explanations. So how can we cope?